Image name

Coniguring the Python Setup

The first thing you will need to do is configure the Python setup for reticulate. This comes packaged in a miniconda format with R and you will need to make sure that you use it with the correct version of Anaconda. When you install.packages(“reticulate”) it creates a miniconda installation for you.

conda_list()
##                    name
## 1           r-miniconda
## 2                pillow
## 3          r-reticulate
## 4 r-reticulate-gary-env
##                                                                                   python
## 1                              C:\\Users\\garyh\\AppData\\Local\\r-miniconda\\python.exe
## 2                C:\\Users\\garyh\\AppData\\Local\\r-miniconda\\envs\\pillow\\python.exe
## 3          C:\\Users\\garyh\\AppData\\Local\\r-miniconda\\envs\\r-reticulate\\python.exe
## 4 C:\\Users\\garyh\\AppData\\Local\\r-miniconda\\envs\\r-reticulate-gary-env\\python.exe
#use_condaenv("anaconda3")

The approach I prefer to use is creating my own conda environment to store all the relevant packages and supporting information I need.

Creating your own conda environment

To create your own environment it is as simple as passing a new environment name, as structured below:

my_env <- "r-reticulate-gary-env"
#conda_create(my_env)

The next step would be to install the relevant Python packages into the R environment. The reason we want to use reticulate is to access all the cool packages that we do not have access to in the native R.

Installing Python packages

The next step of the process is to install the relevant packages that you may reequire to work with in R:

#py_install("pandas",envname = my_env) #Python's data frame library
#py_istall("numpy", envname = my_env) #Python's array library
#py_install("seaborn", envname = my_env) #Python's visualisation library
#py_install("scikit-learn",envname = my_env) #Python's Machine Learning library
#py_install("matplotlib", envname = my_env) #Python's core visualisation library

The next step is to then use the Python new environment we have just created to work with reticulate and R together.

use_condaenv(my_env)
conda_version()
## [1] "conda 4.9.0"
conda_list()
##                    name
## 1           r-miniconda
## 2                pillow
## 3          r-reticulate
## 4 r-reticulate-gary-env
##                                                                                   python
## 1                              C:\\Users\\garyh\\AppData\\Local\\r-miniconda\\python.exe
## 2                C:\\Users\\garyh\\AppData\\Local\\r-miniconda\\envs\\pillow\\python.exe
## 3          C:\\Users\\garyh\\AppData\\Local\\r-miniconda\\envs\\r-reticulate\\python.exe
## 4 C:\\Users\\garyh\\AppData\\Local\\r-miniconda\\envs\\r-reticulate-gary-env\\python.exe

Importing Python packages in reticulate style

For Python users, this bit will look a bit unfamiliar, as we are used to declaring imports this way from mypackage import submodules. Reticulate in R needs these to be stored as R objects and each type set as variables:

#Create python objects as R
numpy <- import("numpy")
## Warning: Python 'C:\Users\garyh\AppData\Local\r-miniconda\envs\r-
## reticulate-gary-env\python.exe' was requested but 'C:/Users/garyh/AppData/
## Local/r-miniconda/envs/r-reticulate/python.exe' was loaded instead (see
## reticulate::py_config() for more information)
pandas <- import("pandas")

# Import libraries for ski-kit learn
sl_model_selection <- import("sklearn.model_selection")
skl <- import("sklearn")
skl_ensemble <- import("sklearn.ensemble")
skl_pipeline <- import("sklearn.pipeline")
skl_metrics <- import("sklearn.metrics")
skl_externals <- import("sklearn.externals")
skl_lm <- import("sklearn.linear_model")

# Import visualisation libraries
sns <- import('seaborn')
plt <- import('matplotlib.pyplot')

The setup has been completed. The next section will look at some basics, before jumping into how to use R and Python together to pass a data frame from R, to the Python ML packages, do some Python visuals, pass back to R and then back to an external Python file again.

Functions from Python in Reticulate

Functions in Python start with def() and to utilise a Python function in R you need to follow the below steps:

py_run_string("def square_root(x):
                value = x * 0.5
                return(value)")

At first you will think, this did absolutely nothing, but it is hidden at the moment. To access Python objects you then need to use the function, as hereunder:

py$square_root(10)
## [1] 5

The py command will show you the list of Python objects that have been made available to R. Now I can access my custom square root function and pass a value to it, this is my preferred way. Another way this can be achieved is in an eval type statement:

py_eval("square_root(10)")
## [1] 5

Modelling with Python and R - with the help of reticulate

The first step is to do some data preparation and wrangling to get the data into the right format. We are going to make this a regression task and I am going to try and predict the temperature based on some other collected variables.

##Sythentically upsizing my data

You would not do this in pratice, but I want the dataset to be larger, so I have created a function that rbinds the data onto itself a number of times:

# Create a bigger version of the air quality data set

make_blobs_of_blobs <- function (number_of_blobs, df){
  n <- number_of_blobs
  do.call("rbind", replicate(n, df, simplify = FALSE))
}

Data Setup

I am now going to set up the data and use my custom function to upsize the data:

air <- airquality
air <- make_blobs_of_blobs(50, air) #Repeat the data 50 times and append to the end of the frame
#Exclude month and day, as we are going to pass to Sk Learn to do a linear model on
air %<>% 
  select(everything(), -c(Month, Day)) %>% 
  drop_na()

The data is now ready, has been upsized, the relevant fields selected and nulls removed from the data frame.

Splitting data

Next, I split the data into predictors(features) and predicted:

# X and Y predictions
X <- air[,1:3]
Y <- data.frame(Temp=air[,4])

Casting to a Python object

The important command to use here is the r_to_py() command to convert the data frame, or R object, into the associate Python Panda’s data frame, or numpy array, etc. R handles this conversion for you.

I will cast the air data frame, the X and Y splits over to Python to use the train test split functionality in Python.

py_air <- r_to_py(air)
py_X <- r_to_py(X)
py_Y <- r_to_py(Y)
py_air$head() # Call the head on the Python object
##    Ozone  Solar.R  Wind  Temp
## 0     41      190   7.4    67
## 1     36      118   8.0    72
## 2     12      149  12.6    74
## 3     18      313  11.5    62
## 4     23      299   8.6    65
py_air$dtypes
## Ozone        int32
## Solar.R      int32
## Wind       float64
## Temp         int32
## dtype: object
py_air$nunique
## <bound method DataFrame.nunique of       Ozone  Solar.R  Wind  Temp
## 0        41      190   7.4    67
## 1        36      118   8.0    72
## 2        12      149  12.6    74
## 3        18      313  11.5    62
## 4        23      299   8.6    65
## ...     ...      ...   ...   ...
## 5545     14       20  16.6    63
## 5546     30      193   6.9    70
## 5547     14      191  14.3    75
## 5548     18      131   8.0    76
## 5549     20      223  11.5    68
## 
## [5550 rows x 4 columns]>
py_air$describe()
##              Ozone      Solar.R        Wind         Temp
## count  5550.000000  5550.000000  5550.00000  5550.000000
## mean     42.099099   184.801802     9.93964    77.792793
## std      33.128722    90.748953     3.54197     9.487799
## min       1.000000     7.000000     2.30000    57.000000
## 25%      18.000000   112.000000     7.40000    71.000000
## 50%      31.000000   207.000000     9.70000    79.000000
## 75%      63.000000   256.000000    11.50000    85.000000
## max     168.000000   334.000000    20.70000    97.000000
#py_list_attributes(py_air)
py_len(py_air)
## [1] 5550

Using Python’s train and test split

I will now use sklearn’s train_test_split function to split my data to sample into training and test splits, for utilisation with sklearn later on:

split <- sl_model_selection$train_test_split(X, Y, test_size=0.75)
#Tap into the model_selection sub module in sklearn to get train_test_split function

This will return a list of elements, as this is how it is held as a tuple in Python. Python is cool as it allows for multiple assignment, but R does not have that capability so I have to index select the relevant data frames stored in a list:

py_X_train <- r_to_py(split[[2]])
py_X_test <- r_to_py(split[[1]])
py_Y_train <- r_to_py(split[[4]])
py_Y_test <- r_to_py(split[[3]])

py_X_train$head() #Use head method in Python
##       Ozone  Solar.R  Wind
## 5535   13.0    112.0  11.5
## 1488   48.0    260.0   6.9
## 3386   50.0    275.0   7.4
## 3824   16.0      7.0   6.9
## 1294   21.0    259.0  15.5

Fitting a model in Sci-kit learn (Python’s ML library)

The next steps fit a linear regression model in sci-kit learn. Unlike R, Sci-kit learn requires you to instantiate the model object before fitting. The code below shows the process:

sk_lm_model <- skl_lm$LinearRegression() #Instantiate the linear regression method
model <- sk_lm_model$fit(py_X_train, py_Y_train) #Fit the model object to the training set - Python takes its inputs in as separate numpy arrays
r_squared <- model$score(py_X_test, py_Y_test) #The model score, for this model, is the r squared value indicating how well the chosen predictors fit the temperature we are trying to predict

To access the model results we use the following code - this will bring back the intercept terms and the coefficients:

model_intercept <- model$intercept_
model_coef <- model$coef_
print(model_intercept)
## [1] 72.52796
print(model_coef)
##           [,1]        [,2]       [,3]
## [1,] 0.1693316 0.006835935 -0.3102768

Making predictions with the model

To make predictions with the model we will use the testing set that we created when we used the sci-kit learn splitting function. This will allow us to validate the model fit visually:

model_predict <- model$predict(py_X_test)
#Create a data frame with the predictions
model_results <- data.frame(Predicted_Temp=model_predict, 
                            py_to_r(py_Y_test),
                            py_to_r(py_Y_test) - model_predict)
colnames(model_results) <- c("Predicted", "Actual", "Residual")

The model predict converts to an R object, however the py_Y_test is still in a native Python format, so I need to use the reverse casting function py_to_r() to convert it back to an object that R can work with. If I tried to pass this directly without the conversion, then I would get an exception error.

The model_results creates a data frame and then I use the colnames() R function to change the names of the columns in the data frame, these names have been passed to a R vector.

Visualising the fit with Seaborn

I will now convert my model_results frame back to a Python format (a Pandas data frame) to allow seaborn to interact with the columns and rows in the df.

# Convert model results back to Python to do stuff with
py_mod_results <- r_to_py(model_results)
py_mod_results$dtypes
## Predicted    float64
## Actual       float64
## Residual     float64
## dtype: object

This will print out the data types of the Python object. In Python, this code would look like this py_mod_results.dtypes, the dollar ($) notation would be replaced with a period (.).

Finally, we will pass this visual through to Seaborn to do something with:

#Create line plot in seaborn
sns$lineplot(data=py_mod_results, x="Actual", y="Predicted")
## AxesSubplot(0.125,0.11;0.775x0.77)
plt$savefig("Images/seaborn.png")
knitr::include_graphics("Images/seaborn.png")

The plot returned is a Python plot, this varies slightly from the R code, as I could use plt$show() directly after the code to view the chart, however this opens in Python and then cannot be integrated into the Markdown book.

Create the same plot in R

I will now create a similar plot in R:

plot <- model_results %>% 
  ggplot(aes(x=Actual, 
             y=Predicted)) + geom_point(color="blue") +
  geom_smooth(method = 'lm', formula = "y ~ x") 

plotly::ggplotly(plot) # Convert to a plotly object

Running an external python script

First, we need to write out the results from our R environment. I will use data.table to write this out quickly:

data.table::fwrite(air, "temperature_pred.csv")

The below example shows how to run an external Python script. This Python script picks up the data from the air data frame and creates an sns pairplot:

# Here we have two plots we will now use to pass python objects through to
py_run_file("Scripts/sns_plot.py") #This has a call to pick up the data and a function 
# to create a pair plot
plt$savefig("Images/snspairplot.png")
knitr::include_graphics("Images/snspairplot.png")

This ran the external python script, returned the chart object, I saved this (as R Markdown cannot view matplotlib plots) and then I load this back in to display.

Creating a correlation matrix with Python’s heatmap

The final example, I will demonstrate how to create a heatmap in Python:

# Finally we will create a correlation matrix in matplot lib 
corr <- py_air$corr()
plt$clf()#Get rid of previous figure
sns$heatmap(corr, annot=TRUE, cmap="YlGnBu")
## AxesSubplot(0.0868538,0.076963;0.649444x0.903778)
plt$savefig("Images/correlation_plot.png")
knitr::include_graphics("Images/correlation_plot.png")

There is more you can do with reticulate, like combining with S3 methods, but for the purposes of passing structures back and forward I find the approach I use is the best method

Find out more

If you want to find the code, click the Github image below, the repos will also be listed when the webinar is posted by the NHS-R community.